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ABSTRACT 

Complex information extraction (IE) pipelines assembled by plumb- 
ing together off-the-shelf operators, specially customized opera- 
tors, and operators re-used from other text processing pipelines are 
becoming an integral component of most text processing frame- 
works. A critical task faced by the IE pipeline user is to run a 
post-mortem analysis on the output. Due to the diverse nature of 
extraction operators (often implemented by independent groups), it 
is time consuming and error-prone to describe operator semantics 
formally or operationally to a provenance system. 

We introduce the first system that helps IE users analyze pipeline 
semantics and infer provenance interactively while debugging. This 
allows the effort to be proportional to the need, and to focus on the 
portions of the pipeline under the greatest suspicion. We present 
a generic debugger for running post-execution analysis of any IE 
pipeline consisting of arbitrary types of operators. We propose an 
effective provenance model for IE pipelines which captures a va- 
riety of operator types, ranging from those for which full or no 
specifications are available. We present a suite of algorithms to 
effectively build provenance and facilitate debugging. Finally, we 
present an extensive experimental study on large-scale real-world 
extractions from an index of ^500 million Web documents. 

1. INTRODUCTION 

Growing amounts of knowledge is being made available in the 
form of unstructured text documents such as, web pages, email, 
news articles, etc. Information extraction (IE) systems identify 
structured information (e.g., people names, relations betwen com- 
panies, people, locations, etc.) and, not surprisingly, IE systems are 
becoming a critical first-class operator in a large number of text- 
processing frameworks. As a concrete example, search engines are 
moving beyond a "keyword in, document out" paradigm to provid- 
ing structured information relevant to users' queries (e.g., provid- 
ing contact information for businesses when user queries involve 
business names). For this, search engines typically rely on hav- 
ing available large repositories of structured information generated 
from web pages or query logs using IE systems. With the increas- 
ing complexity of IE pipelines, a critical exercise for IE developers 
and even users is to debug, i.e., perform a thorough post-mortem 
analysis of the output generated by running an entire or partial ex- 
traction pipeline. Despite the popularity of IE pipelines, very little 
attention has been given to building effective ways to trace the con- 
trol or data flow through an extraction pipeline. 

Example 1.1. Consider an IE pipeline for extracting contact 
information for businesses, namely, business name, address (one 
or many), phone number (one or many), from a set of web pages. 
The pipeline, in addition to others, consists of operators (a) to clean 
and parse html web pages, (b) to classify 'blocks' of text in a web 
page as being useful or not for this task, (c) extract business names, 
(d) extract address(es). (We discuss this real- world pipeline in de- 



tail later in Section[2]) Two interesting points to note here: First, in 
practice, such complex pipelines may be put together using off-the- 
shelf operators (e.g., html parsers or segmenters) along with some 
newly designed as well as some re-usable operators from other sys- 
tems. Second, IE is an erroneous process and oftentimes, output 
from an IE pipeline may miss some information (e.g., a record 
where contact information is present but business name is absent) 
or may generate unexpected output (e.g., associate a fax number 
with a business instead of phone number). 

Say a user of this IE pipeline processes a batch of web pages and 
generates a set of (partial, complete, or incorrect) output records. 
Given the output, the user may be interested in understanding why 
certain incorrect records were generated to identify and eliminate 
their 'sources'; similarly, the user may also be interested in under- 
standing why certain records were missing attributes in the output 
to identify the 'restrictive' operators in the pipeline. □ 

To date, there have been two main approaches for understanding 
the output from an IE but neither fully addresses the problem of 
debugging arbitrary IE pipelines. The first approach is to build sta- 
tistical models to predict the output quality of an IE system | 6 9 1. 
However, these models address the more modest goal of assessing 
the overall output quality and lack the intuitive interaction neces- 
sarily for building debuggers to trace the generation of an output 
record. The second approach involves using complete knowledge 
of how each operator functions. As highlighted by the above exam- 
ple, prior information regarding the specifications of the operators 
may not be available (e.g., off-the-shelf black-box operators). In 
the absence of full function specifications of an operator, the only 
(straightforward) approach to debugging is exploring all data in the 
pipeline. However, such an approach is clearly infeasible due to the 
sheer volume of data. For instance, debugging a simple pipeline in- 
volving 10 operators with 10,000 input records per operator would 
require lOOK records to be manually examined. (As we shall see in 
our experiments in Section|6] typical data sizes are even larger.) 

This paper presents PROBER (for Provenance-Based Debugger), 
the first generic framework for debugging information extraction 
pipelines composed of arbitrary ("black-box") operators. A criti- 
cal task towards building debuggers is that of tracing and linking 
output records from each operator and understanding their trans- 
formations across different operators in the pipeline. To trace the 
lineage of any arbitrary record in the output, we propose a novel 
provenance model for IE pipelines. With debugging in mind, our 
provenance model tries to minimize the amount of user effort nec- 
essary in resolving the fate of the records in the output. For ex- 
ample, provenance for (incorrect) output records only refer to input 
tuples that impacted this output record. We present a suite of al- 
gorithms to build the provenance for an IE run; our algorithms ex- 
plore the tradeoff between efficiency of building provenance, and 
the amount of information captured by it. 

As outlined by Example |1.1| exact functional specifications for 
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Figure 1: Example of an IE pipeline to generate business names 
and their contact information. 

operators in an IE pipeline may not be available. However, prove- 
nance building can exploit various properties of these operators 
that can be learned by sampling, namely, monotonoic, one-to-one, 
one-to-many, or arbitrary. We characterize a diverse set of oper- 
ators, and their properties, that are found in real-world extraction 
pipelines and rigorously examine methods to build provenance in- 
formation for each of these combinations. 

In summary, beyond the conceptualization of PROBER (Sec- 
tion [2|, the main contributions of the paper are as follows: 

• A novel provenance model for the task of debugging infor- 
mation extraction pipelines. Our model effectively accounts 
for scenarios where incomplete (or no) knowledge about the 
underlying operators in the pipeline is available (SectionjS]). 

• A suite of effective algorithms to build provenance, given an 
IE run (Section|4}. 

• An end-to-end solution combining extraction output along 
with provenance information for debugging (Section [5]). 

• An extensive evaluation over real- world datasets, demonstrat- 
ing the effectiveness of our framework in debugging extrac- 
tion pipelines (Section [6|. 

2. PROBLEM FORMULATION 

While IE pipelines may vary in their implementation logic fl][TO| 
[12] [13 1 several underlying common components can be abstracted 



from the implementation details. We characterize information ex- 
traction pipelines for the task of performing post-mortem analysis. 

Definition 2.1. [Record] A record r is a basic unit of data 
(e.g., a tuple), consisting of a globally unique identifier /(r), and 
value V{r). We use R denote the set of all records. □ 

Definition 2.2. [Operator] An operator is defined by a 
function O : (/i, /2, • • • , /at) R, where each U C is a 
set of records. In practice, the function O may be unknown to us. 

□ 

Intuitively, an operator takes as input an A^-tuple of sets of records 
and outputs one set of records. Specifications on how an opera- 
tor generates an output record may be available in varying forms. 
Specifically, we consider the following four scenarios involving op- 
erator specifications. 



An operator is said to be a black-box if we have no informa- 
tion about it. In this case, naturally, the only way to gain infor- 
mation about a black-box operator is by executing it on input sets 
of records. In contrast, we have exact information about an op- 
erator O if we know precisely which input records contributed to 
each output record, and how. We have Input-Output (10) specifi- 
cations when for each output record, we know which input records 
were used to construct it, however exactly how a record is gener- 
ated is unknown to us. Finally, we may have integrity constraints, 
e.g., key-foreign key relationships, satisfied by the input and out- 
put records. For instance, an operator may support a 'debug' mode 
where each output record is assigned an id associated with the in- 
put records that generated it. Effectively, using key-foreign keys we 
have the same information as that in 10 specifications, but this in- 
formation is (indirectly) available using dependencies on the values 
of fields in input and output records. 

Next, we define various (standard) properties of an operator, that 
help design specialized algorithms for building provenance and de- 
bugging effectively. As we will see later, these properties may be 
learned by sampling or the operator specifications (when available) 
described above. 

Definition 2.3. [Properties] 

• monotonic: Operator O is monotonic iff V/i, h C R : 
(/i C h) ^ (0(/i) C 0(/2)). 

• one-to-one: Operator O is one-to-one iff: (a) V/ C 
R ■■ 0{I) = Uei 0( W); (b) Vr € R: \0{{r})\ < 1. 

• one-to-many: Operator O is one-to-many iff V/ C 
R--0{I) = \JreiOi{r}). 

• many-to-one: Operator O is many-to-one iff V/ C 
i?, 3 a partition Pj = {/i, . . . , /n} of ^ such that: (a) 

□ 0{I) = Ur=i 0{U); and(b) Vi : \0{U)\ < 1. 

Definition 2.4. [Extraction Pipeline] An extraction pipeline 
P is defined by a DAG G{V,E) consisting of a set V of nodes and 
a set E of edges where each node v ^ V corresponds to an op- 
erator O in the pipeline. An edge a ^ b between nodes a and b 
indicates that the output from the operator represented by a is input 
to operator represented by b. We have a single special node s ^ V 
with no incoming edges representing the operator that takes input 
to the pipeline, and one special node t with no outgoing edges 
representing the operator that outputs the final set of records. □ 

We now discuss a real- world extraction pipeline (used at Yahoo!), 
which forms the basis of our illustrations in this paper. 

Motivating Example 

Figure [T] shows a real- world extraction pipeline, Bussiness, for 
building a large collection of businesses (see Example [lTJ by ex- 
tracting records of the form (n, a,p), where business n is located 
at address a with contact number p. The first step is to build a set of 
web pages likely to contain information regarding businesses which 
is done using a variety of document retrieval strategies. Specifi- 
cally, we issue manually generated domain-dependent queries (e.g., 
"Toyota car dealership locations") as well as use form filling meth- 
ods where entries such as, model, make, and zipcode, may be filled 
in order to fetch a list of car dealerships. This operator, denoted by 
wb is an example of a black-box operator with arbitrary properties. 

Given a collection of web pages, operator sg parses the html 
page and identifies appropriate segments of text in this page, where 
ideally, each segment contains a complete target record (see Fig- 
ure [4] in the appendix for a real- world example). These segments 
are then processed by operators, ad and pn, which respectively 

\a) I = Ur=i (b) Vi / j : {I^ n /,) = 0. 



identify an occurrence of an address and a phone number. The 
annotation from one operator is used by the subsequent operator to 
identify regions of text that should not be processed, ad and pn are 
implemented using hand-crafted patterns based on a dictionary of 
address formats. The nm operator on the other hand needs to iden- 
tify names of business which may be arbitrary strings and for this, 
we follow a wrapper-induction appraoch. In particular, using some 
training examples we learn a wrapper rule to identify candidate 
business names; these rules are based on the document structure 
of the html content. Of course, several other implementations for 
each of these operators are possible and the implementation details 
are orthogonal to our discussion since our goal is to build debug- 
gers for pipelines with black-box operators where no implementa- 
tion information may be available. The jn operator joins outupt 
from ad, pn, and nm to build candidate output records which are, 
in turn, processed by dp to eliminate duplicates. The final operator, 
assignes a confidence score s c to each output record. 

We note that all our implementations of the above operators 
are monotonic. (Obviously, there may be non-monotonic imple- 
mentations in other pipelines, but we primarily consider mono- 
tonic operators in this paper.) Although monotonic, the operators 
from the pipeline span a variety of properties, e.g., segmentation 
is a one -to-many operation, and by design one address is ex- 
tracted from each segment, so address extraction is one-to-one, 
while de-duplication is many-to-one. Candidate webpage gen- 
eration and wrapper training, on the other hand are arbitrary, i.e. 
"many- to-many". 

Given unexpected output records, an IE developer may want to 
answer some natural questions about the output. (Figure |4] in the 
appendix shows an example where s g generates an incorrect seg- 
ment that leads to missing one address and extracting one incor- 
rect address.) Specifically, a developer may be interested in tracing 
all or part of the input records that contributed to a particular out- 
put record. For instance, given an incorrectly extracted record, we 
would like to know only the relevant subset of webpages and train- 
ing data that impacted it, i.e., the minimal amount of input data 
necessary to identify the error. Motivated by the above observa- 
tions, we focus on the following problem in this paper. 

Problem 2.1. Given a pipeline P, input /, and partial infor- 
mation about operators in P, we would like to (1) build provenance 
for the set of (intermediate and final) records in the pipeline; (2) ex- 
pose provenance to developers through a query language and guide 
them in debugging the pipeline. 

3. PROVENANCE FOR IE PIPELINES 

The notion of provenance is relatively well-understood for tra- 
ditional relational databases (refer |[5] [TTJ). A commonly ad- 
vocated model |2| is to use a boolean-formula provenance, e.g., 
Si/\{S2 V ^5*3 ) . For the purpose of debugging extraction pipelines, 
such provenance models are not appropriate for two main reasons. 
First, unlike relational queries where the exact specifications of 
each operator are known, we may have black-box operators in our 
extraction pipeline. Second, for debugging, ideally we would like 
to limit the number of records (and simplify their interdependencies 
typically represented as boolean formulas for relational operators) a 
human has to assess in order to understand the issue at hand. With 
these in mind, we define a provenance model based on minimal 
subsets of operator inputs that capture necessary information (Sec- 
tion |3.1| ) and extend this basic model to operators where multiple 
minimal subsets may exist (Section [T2| . 

3.1 MISet: Basic Unit of Provenance 



To define the provenance of an IE pipeline P, we begin by defin- 
ing the provenance for each operator in P; the subsequent sections 
show how to construct the provenance for each operator in P (Sec- 
tion |4]) and for P by composing individual operators' provenance 
(Section [5]). We primarily confine ourselves to extraction pipelines 
consisting of only monotonic operators (see Definition |2. 3 1 ), which 
are a common case in practice (as in our motivating example from 
Section [2|; extensions to non-monotonic operators is very briefly 
discussed in the appendix (Section |C]), but largely left as future 
work. 

We define the provenance of an extraction operator O based on 
the provenance for each output record r G Rfox O. Ideally, we 
would like the provenance of r to represent precisely the set of 
records that contributed to r, however, as we will see, in practice 
it may not be possible to always determine the precise set of con- 
tributing records (e.g., in the absence of exact information about 
O), and even if possible it may be computationally intractable. For 
our goal of building a debugger, we observe that one of the main op- 
erations we expect users to perform is look at an (erroneous) output 
record r, and explore its provenance to determine the cause of this 
error. Therefore, a suitable provenance model is one that enables 
users to examine the fewest records required to decide the fate of 
an output record r. Formally, we define a basic unit of provenance, 
called MISet as follows: 

Definition 3.1. [MISet] Given an operator O, its input / 
and output R, we say that Is C / is a Minimal Subset (MISet) of 
r G if and only if: (1) r G 0{ls)\ and (2) \fl' dh-r ^ 0{I'). 
We use Mall (O, I,r e R) to denote the set of all MISets of O for 
input / and output record r G i?. □ 

Intuitively, an MISet gives the fewest input records required for 
a particular output record r to be present. Therefore, an MISet 
provides users with one possible reason for the occurrence of r. 
This, in turn, reduces the burden of manual annotation on the users; 
in the absence of MISets, a user may have to explore the entire 
input to understand what caused an error in the output. The notion 
of MISets primarily focuses on debugging the presence of records 
in the output; in Appendix [C] we briefly discuss a corresponding 
notion (MASets) for the case of non-existence of records in the 
output, but leave details for future work. 

In practice, we may have more than one MISet possible for an 
output record as shown by the following example. 

Example 3.1. Consider a (simple) record validation operator 
(e.g., sc in Figure [T]) that computes the "support" of each record 
and outputs only records with sufficient support. Suppose s c out- 
puts a record r if there are atleast 50 input records supporting it. 
Given an input of 100 records that could support r, the MISet for r 
is any subset of the input records of exactly 50 records. □ 
When multiple MISets are available, several ways of building 
provenance are possible, each differing in the extent to which they 
impart infomation and execution speed, as explored next. 

3.2 Handling Multiple MISets 

Several formalisms for provenance model are possible when 
multiple MISets are available. We rigorously examine composi- 
tions of MISets, while capturing the spectrum of complete (and 
potentially intractable) provenance, to more tractable (but approx- 
imate) provenance. Later in Section |4] we will present algorithms 
for building each of type of provenance. 

Consider an operator O which consumes input / and generates 
output R; for a record r G i?, we denote the provenance of r 
as P(0, /, r). We use subscripts P* to capture various types of 
provenance and when clear from the context, we simply use P* (r) 
to denote P*(0,/,r). 
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Require: O, /, r G 0(/) 
1 : Find an MISet M (Algorithm[43) 
2: for m G M do 
3: if r G 0{I - {m}) then 
4: RETURN "Non-unique" 

5 : end if 

6: end for RETURN "Unique" 
Algorithm 1: Testing the uniqueness of MISets for any monotonic operator. 

specific cases (e.g., exact or I/O specfications); however whenever 
necessary we will point out the differences. Table [T] summarizes 
the results on provenance inference achieved in this section, with 
details in the following subsections. All time complexities in the 
table are given in terms of the size of the input N and the size of 
the output M. The table gives time complexity assuming an opera- 
tor can be executed freely (i.e., in C^(l)); depending on the running 
time of the operator, the appropriate factor can be multiplied. 

4.1 Unique MISet Operator 

We begin with the case of when a combination of input /, out- 
put R, and operator O has a unique MISet for each output record. 
Note that we don't assume that we know the uniqueness of MIS- 
ets; instead, we only need the existence of a unique MISet. (That 
is, our results hold even in the case when we don't have any in- 
formation about the black-box operator, but it just happens that the 
operator functions in a way that creates a unique MISet for output 
records.) We have the following main result for unique MISet op- 
erators. (Complete proofs for all results in the paper are presented 
in Appendix [A|) 

Theorem 4.1 (Compute MISet). Given any monotonic 
operator O, input I, and output R, if O has a unique MISet for 
each output record r ^ R, then the unique MISet for r can be 
constructed in 0{N). 

As a consequence, all entries in Table[T]for unique MISet operators 
can be solved in 0{N). The above result follows from: (a) the 
following lemma that tests for uniqueness of MISets; and (b) the 
fact that a single MISet for any monotonic operator can be com- 
puted efficiently using Algorithm |4 . 3 1 (Lemma |4 . 2 1 presents a more 
general result for k MISets computation shortly). 

Lemma 4.1. [Uniqueness Test] Given any monotonic operator 
O, input /, output R, and any r G i?. Algorithm [T] tests in 0{N) 
whether there is a unique MISet for r. 

The algorithm works in two stages. First, it finds any MISet M for 
r. Second, it attempts to find other sets that might produce r by 
applying O on the entire input except one record from M. If none 
of these sets produce r, there is no other MISet. 

4.2 One-to-One and One-to-Many Operators 

For the relatively simple cases of one-to-one and 
one- to-many operators, we can obtain easily our compos- 
ite provenances in linear time in the size of the input N 
for one-to-one operators and linear time in + M for 
one- to-many operators, as shown by the following theorem. 

Theorem 4.2. Given a one-to-one operator O, input / of 
size N, and an output record r G i?, each of Paii{r), Pany{r), 
Puni{r), Ptnt{r), and Ptmp{r) can be computed in 0{N). For a 
one- to-many operator, the complexity is increased to 0{N + 
M), where M = |0(/)|. 

4.3 Many-to-one and Arbitrary Operators 



Table 1: Complexity of our algorithms for obtaining Paii, Panyi Punii Pinti 
and Pimp for various properties of an operator O with input / and output O 
(N = \I\ and M = | O | ). ^ denotes the number of MISets that need to be found, 
and a is the size of the largest MISet (a < N). * denotes cases where complexity 
becomes PTIME when restricted to bounded-size MISets. 

All- and Any-provenance: Ideally for any output record, we 
would like to provide all possible information using MISets, i.e., 
capture all possible "causes" of an output record. 

Definition 3.2. [All-provenance] Given an operator O, in- 
put /, output R, and r G i?, we define all-provenance as Paii (r) = 
Maii{0,I,reR). □ 

In many cases Paii may be intractable to compute or store, and 
we may need to resort to "approximations" of it, presented shortly. 
Alternatively, we may want to find any one (or k) MISets. 

Definition 3.3. [Any-provenance] Given integer /c > 0, op- 
erator O, input /, output R, and r G i?, we define any-provenance 
as any Pany{r) C Paii(r) of size min(A:, \Paii{r)\). □ 

Impact-provenance: Given restricted amount of editorial re- 
sources, we may want to explore the most impactful, i.e., top-/ input 
records sorted by their impact instead of rniy-k MISets. Our next 
definition of provenance ranks tuples based on their expected im- 
pact on the output record, measured by the number of MISets in 
which a tuple is present. 

Definition 3.4. [Impact-provenance] Given an operator O, 
input /, output R, and r G i?, we define impact-provenance as 

P^mp{r) = {(i,cO|3M G Pall^i eM,C^ = EMGP„,,,^GMl}• 
□ 

Union- and Intersection-provenance: Our next goal is to sum- 
marize Pall using two approximations: (1) We obtain an "upper 
bound" provenance that captures the set of all possible inputs pos- 
sible for r, instead of exact combinations of inputs. Therefore, we 
define the union-provenance of r to be the union of all its MIS- 
ets. (2) We obtain a "lower bound" provenance that captures the set 
of all possible inputs necessary for r; we define the intersection- 
provenance of r to be the common input records among all MISets. 

Definition 3.5. [Union-Provenance] Given an operator O, 
input /, output R, and r ^ R, wq define union-provenance as 

Puni(o) = [jl^eM^ii(0,I,reR) ^ 

Definition 3.6. [Intersection-Provenance] Given an oper- 
ator O, input /, output R, and r G i?, we define intersection- 
provenance as P^nt{o) = r\i,eM,n(o,i,reR) ° 

It can be seen easily that for operators with unique MISets, Puni 
and Pint coincide. 

4. INFERRING PROVENANCE 

We now turn to the critical task of deriving the provenance 
formalisms proposed in Section |3.2| We primarily consider the 
generic black-box operator case while rigorously examining var- 
ious properties described in Section |2] and our carry over for more 



Require: O, I,r e 0{I) 
1 : Set M = / 
2: for m G M do 
3: if r G 0(M - {m}) then 
4: M=M-{m} 
5 : end if 
6: end for RETURN M 

Algorithm 2: Computing a single MISet for any monotonic operator. 



Require: O, /, r G 0(/), p MISets {Mi, . . . , Mp} 

1 : for (mi , . . . , mp) G Mi x . . . X Mp do 
2: Set /' = / - {mi, . . . , mp} 
3: ifr G 0(/') then 

4 : RETURN Algoritlim[43lresult using O , /' , r as input. 

5 : end if 

6: end for 

7: RETURN "No other MISet" 



Require: O, /, r G Op{I) 

1: Sets' = 

2: for i G / do 
3: if r ^ 0(/ - {i}) then 
4: S' = S'U{i} 

5 : end if 

6: end for RETURN 5. 

Algorithm 4: Computing Pint{r) for any monotonic operator O. 

Computing P^mpi Finally, we show that the hardness result Paii 
can be extended easily to Pimp- 

Corollary 4.1. Given any arbitrary or many-to-one 
monotonic operator O, input / of size N, output R of size M, and 
an output r G i?, it is #P-complete in N and M to compute Pimp- 

4. 3. 1 Bounded-size MISets 

Given the intractability of the general problem for Paii and Pimp 
for many-to-one and arbitrary monotonic operators, we ex- 
plore an intuitive tractable subclass. We consider a practical special 
case of all MISets being of small size (i.e., bounded by a constant). 
The following theorem shows that we can now infer all types of P 
in polynomial time, using an explicit search. 

Theorem 4.5. Given any monotonic operator O, input /, and 
output r e RwQ can find each of Paii, Pany, Puni, Pmt, and P^mp 
for r in ^ , when for every S G Mail (r), we have \S\ < B. 

5. PUTTING IT ALL TOGETHER 
5.1 Composing Operator Provenance 

So far, we focused only on computing provenance for a single 
operator. We now consider composition of provenance from single 
operators into provenance for a chain of operators. Our goal is 
to understand to what extent (if at all) we can use each individual 
operators' provenance to determine the provenance of a pipeline. 
Formally, we would like to solve the following problem. 

Problem 5.1. Given monotonic operators Oi,02, input /i, 
outputs Ri — Oi(/i) and R2 — 02{Ri), and r2 G R2, can we 
compute P*(02 o Oi,/i,r2 G R2) from P*(02,i?i,r2 G R2) 
andP*(Oi,/i,ri G Ri)- 

Intuitively, we are interested in generating all provenance P* for the 
composition operator O12 = (O2 o Oi) from all the provenance 
P* of each of Oi and O2- Before proceeding to solve the above 
problem, we make two observations. First, note that Problem [5?T] 
could have been equivalently defined if Oi and O2 were themselves 
chains of operations with P* being the provenance of these chains. 
Our algorithms in Section|4] and hence results in this section, make 
no assumption on Oi and O2 being single operators, so all our 
results carry over when they are chains of operators. Second, our 
goal is to explore to what extent the provenance of Oi and O2 can 
be reused to generate the provenance of O12, without any additional 
execution of Oi or O2- 

Our first main result shows that the execution of O2 o Oi can 
be completely simulated using Paii for Oi and O2, and hence all 
provenance of O2 o O2 can be computed as in Section [4] The the- 
orem gives a constructive algorithm, and hinges on the core idea 
that Pall for any monotonic operator captures enough information 
to execute the operator on any subset of its input. 

Theorem 5.1. Given monotonic operator Oi,02, input /i, 
outputs Ri = Oi{Ii) and R2 = 02{Ri), and r2 G R2, for 
any Is C /i, we have r2 G (O2 o Oi)(/s) if and only if 



Algorithm 3: Algorithm for finding an MISet different from a given set of p MIS- 
ets. The algorithm can be applied multiple times to generate several distinct MISets. 



Computing Panyi We can always find an MISet using an 0{N) 
algorithm for any monotonic operator O. Algorithm |4. 3 1 describes 
how to find such an MISet. Algorithm [3]pro vide s an extension that 
finds k > MISets (when k MISets exist): For brevity, we specify 
the algorithm to find a different (p + l)th MISet given p MISets. 



The algorithm is invoked (k — 1) times after Algorithm 4.3 suc- 
cessively adding MISets to obtain k distinct MISets. The following 
lemma establishes our result. 

Lemma 4.2. Given any monotonic operator O, input /, and 
output record r G 0(/), Pany for k MlScts Can be computed in 

Note that the actual complexity is 0{Na^) where a is the size of 
the largest MISet. Therefore, if all MISets are small, the algorithm 
runs very efficiently. 

Computing Pint ' Pint for any arbitrary operator can be computed 
using an 0{N) algorithm. Algorithm [4] shows how to obtain Pint 
and the theorem below establishes the result. 

Theorem 4.3 {P^nt Computation). Given any monotonic 
operator O, input I, and output record r G 0{I), Algorithm^ 
correctly computes the Pint{r) with 0{M + N) executions ofO. 

Computing Puni'* To compute the Puni of an output record r G 
0{I), for each input record i G /, we need to determine whether 
there exists any MISet M containing i. We employ a simple ap- 
proach to determining if there exists any MISet with i. We simply 
find all MISets (using Algorithms |4 . 3 1 and [3]) , and determine their 
union. Note that in the worst case, our naive algorithm takes expo- 
nential time in |/|. Finding better upper or matching lower bounds 
is an open problem. 

Computing Paiii For Paii, we show that finding Paii for an 
many-to-one monotonic operator is #P-completqjin the size of 
the input; i.e., there does not exists any polynomial-time algorithm 
to compute Paii exactly. Our result is proved using a reduction 
from the problem of finding all Minimal Vertex-Cover in an undi- 
rected graph. 

Theorem 4.4. Given any arbitrary or many-to-one 
monotonic operator O, input / of size N, output R of size M, 
and an output r G i?, it is #P-complete in N and M to compute 

Pall- 

Shortly, we show that the complexity can be made PTIME by re- 
stricting ourselves to MISets of bounded size. 

^#P-completeness corresponds to the class of hard counting prob- 
lems. 



Properties 


P4O20O1,) 


Oil arbitrary 
O2: arbitrary 


P..t^^(r2)3U.^eP,..2(,,)(P..t^(ri)) 


Oi: one-to-one 
O2: arbitrary 


Pa^^^'(r2) = {U.ies2 Pa../(si)|s2 G Pa^z"(r2)} 


Oil arbitrary 
O2: one-to-one 


Pi'(r2) = Pi(P^(r2))* 



Table 2: Problem [5!l] for combinations of properties for Oi and O2 operating 
on inputs Ii and Ri = Oi (/i ) respectively, and R2 is the result of O2. The ta- 
ble uses P^'^ (r2 ), P^^ (r2 ), and P^ (ri ) as shorthands for P* (O2 o Oi , /i , r2 G 
P2), P* (O2 , Pi , r2 G P2) and P* (Oi , /i , ri G Pi) respectively, "t" * stands 
for one of uni, int, and any. * We have slightly abused notation to apply P^ to a 
singleton set, instead of the record in the set itself. 

3M2 G Paii{02,Ri,r2) such that M2 C Os, where = 
{ri\3M G PaiiiOiJun e s.t. M C /,}. 

Given the above result, we know that Paii for Oi and O2 contain 
enough information to compute all P* for O2 o Oi . However, using 
Pall can be expensive because of the number of possible MISets. 
Hence, our next goal is to attempt to use other forms of provenance 
of Oi and O2, i.e., without enumerating Paii. If some provenance 
of O12 cannot be computed directly, we can fall back on the tech- 
niques from Section|4]to generate the provenance or use Paii using 
Theorem 15. II 

Next we look at special cases of operators and determine when 
P* for O2 o Oi can be computed efficiently. Table [2] summarizes 
our results. The table is complete in the sense that for any P* 
not present in the table, we must use Paii for Oi and O2 (using 
Theorem |5.1| ) to compute the entry, or resort to techniques in Sec- 
tion |4] Given all possible combinations of operator properties is 
too many to list (16 combinations of arbitrary, many-to-one, 
one-to-one, and one-to-many). Table [2] presents a delin- 
eating subset of results. All other combinations of results can 
be derived from the entries in the table. For instance, when O2 
is one-to-one P}^ can be computed for arbitrary Oi, hence 
other combinations involving Oi aren't present in the table. Sim- 
ilarly, we only consider arbitrary O2 when Oi is arbitrary. Fur- 
ther, we do not consider many-to-one separately as results 
for many- to-one and arbitrary are similar. Also, results for 
one-to-one, one-to-many are similar as they both ensure 
a unique singleton MISet for each output record. We don't sepa- 
rately consider unique MISets, as all of Paiu Pany, Puni, Pint are 
the same, and computed easily. Finally, the table omits Pimp as our 
solution for Pimp is equivalent to that of using Paii. 

5.2 Properties and Provenance Selection 

To summarize, given an IE execution our approach, PROBER, 
allows users to specify the output records that they are interested in 
debugging. Either using information such as 10 specifications or 
integrity constraints or using sampling, PROBER attempts to iden- 
tify the type of the operators (e.g., one-to-one, or arbitrary). 
In the absence of any conclusive information, PROBER assumes 
an arbitrary operator. For each operator or pipeline, users may 
choose the type of provenance they want based on editorial re- 
sources available. Note that users may always start with the conser- 
vative Pint or Pany, and explore more complex provenance, such 
as Pall as needed, or ask for input to be ranked, such as Pimp- 
We make two important observations regarding this user explo- 
ration: (1) Whenever operators satisfy restricted properties (such 
as one-to-one), PROBER readily computes all forms of prove- 
nance very efficiently. (2) For arbitrary monotonic operators, all 
our algorithms proceed in a "pay-as-you-go fashion"; for instance, 
even if a user would like to perform an in-depth analysis of Paii 



leading to a potentially expensive computation, PROBER starts re- 
turning MISets immediately and progressively provides more in- 
formation as available. Specifically, our Pany algorithm keeps iter- 
atively finding new MISets, which are returned to users as found. 

6. EXPERIMENTAL EVALUATION 

We now present results from our experimental evaluation. Af- 
ter describing our data sets (Section |6.1| ), we present a qualita- 
tive study of PROBER (Section \6.2\ . Next, we evaluate the ef- 
fectiveness of our basic unit for provenance, namely, MISets (Sec- 
tion |6.3| l. Then, we perform a detailed study of various provenance 
formalisms (e.g., Puni, Pint, Pany, etc.) by discussing basic statis- 
tics (Section [6!4] ), and then, compare their coverage (Section [63] ) 
and execution times (Section |6]6]). 

6.1 Experimental Settings 

Data sources: We used a collection of 500 million web pages 
crawled by the Yahoo! search engine. 

Extraction pipelines: For our IE tasks, we implemented two 
pipelines. Our first pipeline, denoted, Bussiness,isas described 
in Section[2] For our second pipeline, denoted Iterative, we 
reimplemented a state-of-the-art bootstrapping exraction technique 
described by Pasca et al. | ,12J for large-scale datasets such as Web 
corpora which is similar in spirit to other IE pipelines such as 
Snowball |1| and Espresso fTS]. 

Extracted relations: As extraction tasks, we focus on six relations 
(the last column shows the number of extracted tuples): 



1 


footwear: 


(name, address, phone) 


340,131 


2 


actors: 


(movie, actor) 


14,414 


3 


books: 


(book, author) 


142,337 


4 


mayor: 


(U.S. city, mayor) 


28514 


5 


sen-party: 


(senator, affihated party) 


2,119 


6 


sen-state: 


(senator, state) 


14,582 



We built footwear using Bussiness using a corpus of 
5,443,183 web pages from 147 sites (see Section[2]); all the other re- 
lations were built using Iterative. Our qualitative analysis pre- 
sented next is using the footwear dataset. The empirical analysis 
that follows was performed on each of actors, books, mayor, 
sen-party, and sen-state. For most of our experiments we 
show results for the high-confidence tuples from our datasets, as 
these results are the most interesting: High-confidence tuples have 
most number of contributing input records and are therefore the 
hardest for provenance and debugging. 

6.2 Qualitative analysis of PROBER'S utility 

To gain insights into the utility of PROBER, we performed 
a qualitative analysis of the records generated for footwear. 
Among the final set of output records, 38% were missing business 
names, 40% were missing phone numbers, 37% were missing ad- 
dresses. To give a flavour of user interaction with PROBER, we 
qualitatively depict a debugging analysis for a specific erroneous 
record. In particular, we explore a record, r, ('AUSTIN, TX', 'Bur- 
nett St, Austin, Texas 78703', null) which has incorrect values for 
business name and a missing value for phone number. Through 
the source web page associated with this record, we found that our 
first operator, namely sg, had incorrectly segmented the page. As 
shown in Figure |2] sg generated an incorrect segmentation for the 
second and third business contacts listed on the page. By fixing 
this segmentation, we debug and correct record r as well as other 
records extracted from this page. 

6.3 Is MISet an effective representation? 



JCPENNEY - WOMEN'S SHOE DEPT. 
151 UNIVERSITY OAKS 
UNIVERSITY OAKS S/C 
ROUND ROCK, TX 78664 
512-341-0764 

20 miles Map 

"FITriNGsfoOL 
2438 W. ANDERSON LN 
AUSTIN, JX 78757 

21 miles Mafi 

SHOE BOXES BY THE FITTING STOOL 

AUSTIN, TX 78757 

21 miles Mac 

KARAVEL SHOES 
5525 BURNETT RD #1 
AUSTIN, TX 78756 

21 miles Hap 



InStep 

1010 W. 38th Street #150 
Austin, TX 78705 
(512) 476-5110 
22 miles Mafi 

WHOLE EARTH PROVISION 
2410 San Antonio Street 
AUSTIN, TX 78705 
(512) 478-1577 
22 miles Mafi 

Figure 2: Incorrect segmentation causing incorrect output. 

Earlier in Section [3] we proposed MISets as the primary rep- 
resentation to collect information related to an output record for 
debugging purposes, and provided concrete theoretical justification 
for our choice. Of course, other representations are also possible in 
practice. Next, we present an experimental comparison of MISets 
against three strong baselines that could be used to collect tracing 
information for an output record. Specifically, we compare the fol- 
lowing methods for generating tracing information. 

• All-recs: The naive baseline of repeatedly exploring all 
input records for every output record. 

• Wrd-OR: Using the bag of words in an output record, we 
build a keyword query to fetch all input documents contain- 
ing at least one term using a standard IR-like search interface. 

• Wrd-AND: Similar to Wrd-OR, except we only fetch docu- 
ments containing all terms in the output record. 

An important note about the Wrd-OR and Wrd-AND baselines is 
that they exploit specific information about the extraction opera- 
tors, namely, that input records aren't "mangled", i.e., terms are 
preserved by the extraction. MISets, on the other hand, use no such 
information. Since in our extraction scenario, we chose operators 
that do indeed preserve terms in records, our comparison is unfair 
in that it favors Wrd-OR and Wrd-AND. Our goal was to compare 
MISets with the best possible scenario for our baselines. (Clearly, 
in a fair comparison including operators that generate new terms or 
alter terms in input records, Wrd-OR and Wrd-AND won't even be 
applicable, and All-recs would be the only feasible baseline.) 

Figure [3(a)] presents our results comparing each method (MISets, 
Wrd-AND, Wrd-OR, All-recs) by examining the total number 
of input records that need to be fetched in order to generate the 
tracing information, varying the number of output records. By de- 
sign, Wrd-AND retrieves the fewest possible documents and MIS- 
ets completely coincides with Wrd-AND. Indeed, this "experimen- 
tally proves" our claim from Section [3] that MISets retrieve mini- 
mal sets of records from the input. Note that even in our favorable 
setting for keyword-based retrieval, Wrd-OR retrieves many more 
input record^ and All-recs is even more prohibitively expen- 
sive. 

6.4 Size of provenance formalisms 

Next we explore the size of provenance generated using each 
of our formalisms: Paii, Puni, Pint, and Pany with k = 1,3,5. 
(Pimp isn't shown as the size of Pimp is naturally equal to the num- 
ber of input records requested.) Figure [3(b)] shows the average size 
of the provenance generated for each of the provenance formalisms 
over a set of 50 tuples ranked by their confidence scores. It is note- 
worthy that the sizes of the provenances, and in turn, the manual 

^Note the log-scale on the y-axis 



effort necessary can substantially vary across tuples. Since Paii 
maintains all possible MISets, it is the largest. From the figure, 
we learn that a practical choice for users would be to start explor- 
ing P^nt or Pany-1, then rcqucst Pany-k for /c > 1 and Puni if 
necessary. 

To gain more insight into the distribution of sizes for individual 
tuples. Figure [3(c)] plots the size of each provenance type for the 
top-30 tuples. The size of Paii varies significantly but is almost 
always significantly more than all other provenance types. The two 
cases where Paii coincides with other provenance types are exam- 
ples of output records with unique MISets. The minor variations 
in the sizes of all other forms of provenance are obscured by the 
log-scale for the y-axis. 

6.5 Coverage 

Next we explore the coverage of each provenance model mea- 

I p I 

sured as |p *' | . Our goal is to determine what fraction of all poten- 
tially contributing input records is retrieved by any single MISet or 
any arbitrary 3 or 5 MISets, as well as by Pint- Figure [3(d)] shows 
the coverage averaged over the set of output records. Pint has very 
low coverage indicating that very few input records are essential in 
producing any output record; in other words, in most cases there 
are many different explanations for the same output record. Pany 
(for k = 1, 2, 3), on the other hand, retrieves a sizable fraction of 
all contributing input records. This indicates that using Pany is a 
practical solution to start debugging, by retrieving the initial set of 
input records, and if necessary request more MISets. 

We treat the coverage study for Pimp as a special case. Since 
the coverage of Pimp depends on the number of ranked tu- 
ples retrieved, we measure the coverage of iop-k Pimp records 
{ri, . . . , rfc} using two measures: (1) Record-coverage measured 
as the fraction of the total number of record appearances of these 

k records in Paii. That is \ where d denotes the number 

of MISets containing r^, and Paii contains records {ri, . . . , r^}. 
(2) MISet-coverage measuring the fraction of the total num ber of 
MISets that contain some tuple in {n, . . . , rfc}. Figure [3(e)] shows 
these coverages for Pimp', we observe that Pimp is very effective in 
giving very high MISet-coverage with very few retrieved records, 
justifying that retrieving few tuples from Pimp can be very useful 
in debugging with a high representation of almost all MISets. We 
get high incremental value for initial records, with diminishing re- 
turns as we retrieve more tuples. For record-coverage the trend is 
closer to a linear increase in coverage. Overall, we observe that 
Pimp (along with with Pany and Pint) can be effective tools for 
debugging, with the caveat that Pimp is computationally more ex- 
pensive (see Section [63] l. An interesting open question arising is 
that of efficiently (to the extent possible, given our #P-complete re- 
sult from Section [4]) retrieving just sufficient number of records to 
meet a coverage demand. 

6.6 Build time 

Finally, we study the time required to build provenance in 
PROBER, which directly depends on the amount of data fetched. 
Figure [3(Q] plots the number of input records fetched for each type 
of provenance, varying the number of high-confidence records. We 
note that Paii, Puni, and Pimp are the most expensive computa- 
tionally, while the amount of data fetched for Pint and Pany for 
/c = 1, 3, 5 is significantly less. Since Paii requires a large amount 
of data to be fetched, we studied the behavior of our algorithm for 
finding all MISets when the size of each MISet is bounded below 
5 (Section p. 3. l[ l. We notice that this is more expensive than Pany 
and Pint but significantly faster than Paii, and hence information 
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Figure 3: (a) Number of documents fetched representing the amount of work necessary when using different debugging paradigms, (b) Average size of various 
provenances over 50 tuples, (c) Size of provenance generated for top-30 tuples ranked by confidence scores, (d) Coverage of different provenances with respect to Puni' 
(e) Coverage of P^^p with respect to total number of MISets, and total contribution of all input records, (f) Data fetched to derive various provenance definitions. 



on the size of each MISet can potentially be useful. 

6.7 Evaluation summary 

In conclusion, we established the utility of PROBER over a va- 
riety of relations. MISets pick out minimal sets of input records in 
comparison to other baseline methods thus enabling rapid resolu- 
tion of output records. Our provenance formalisms may substan- 
tially vary in their sizes and we discussed how users may gradually 
move from exploratory provenances to more complete ones. Fi- 
nally, we studied the tradeoff between coverage and execution time 
for various provenance formalisms. 

7. RELATED WORK AND CONCLUSIONS 

Here we present a very brief discussion of related work, with a 
more comprehensive description appearing in Appendix [B] Some 
recent work l[6]|7][8]|9][T6| has broadly looked at providing explo- 
ration phases that enable users to determine if a text database is 
appropriate for an IE task. However, users are provided with little 
or no insight into why unexpected results are produced, and how to 
debug them. Another interesting piece of work 1 15 1 presented tech- 
niques to build IE programs using Datalog for greater readability 
and easier debugging. Our recent work 1 14] considered debugging 
for iterative IE, and |4 | looked at provenance for non-answers in 
results of extracted data. However, these papers assume complete 
knowledge of each operator in some form, such as access to code 
for each operator, or SQL queries applied to input data. Finally, 
there is a large body of work on provenance for relational data (re- 
fer (5] [it)), and more recently 1 3 1 on understanding provenance 
information. This work does not address the problem of building 
provenance for black-box operators to facilitate IE debugging with 
minimal editorial effort, the primary goal of our work. 

In conclusion, we presented PROBER, the first system for ad- 
hoc debugging of IE pipelines. At the core of PROBER, is a 
suitable MISet-based provenance-model to link each output record 
with a minimal set of contributing input records. We provided ef- 
ficient algorithms and complexity results for provenance inference, 
and an extensive experimental evaluation on several real- world data 



sets demonstrating the effectiveness of PROBER. A few specific 
directions for future work arise, such as tighter bounds for Puni 
inference, and extending to non-monotonic operators. A more gen- 
eral direction of future work we are currently pursuing is to develop 
an interactive GUI for PROBER and perform a user study by de- 
ploying it for multiple extraction frameworks at Yahoo!. 
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APPENDIX 
A. PROOFS 

Proofs of Theorem |4.1[ Lemma HTTJ and Lemma [4l2{ To prove 
Lemma [4?7] consider Algorithm [ij which attempts to find non- 
uniqeueness of MISets, starting from a given MISet M obtained 
from Algorithm 4.3 Any other MISet M' cannot be a superset of 
M (else it wouldn't be minimal). Therefore, there must exists some 
m G M and m M^ which implies that / - {m} D M' . There- 
fore 0{I — {m}) must contain r for some m G M if there exists a 
MISet other than M. 

The basic algorithm for Lemma [4!2] is Algorithm |4.3| which finds 
any MISet for a given input /. Finding k MISets is simply ob- 
tained by modifying input / and calling Algorithm |4. 3 [ recursively: 
To find an MISet different from M, for each element m G M, 
Algorithm 4.3 is called with / — {m}. 
ets Ml , . . . , Mp 

fer from each of Mi, . . . , Mp. Hence, there must exist a p-tuple 
(mi , . . . , rrip), rrii G Mi, such that M' Q I — {mi , . . . , rrip}. 
Our algorithm attempts to find an MISet for every such p- tuple. 
Finding the (p + l)th MISet needs to iterate over |Mi | • . . . • |Mp| 
p-tuples in the worst case, giving us the required complexity. 

Theorem |4.1| follows based on checking whether O gives a 
unique MISet (Lemma pTT) , then using Lemma [42] with k — 1. 
□ 



an MISet: Is is an MISet if and only if no subset of it obtained 
by removing a single element returns {!}, and 1 G Op{Is). 
Therefore, we can check for all sets in any Paii, whether each 
of them is an MISet. □ 



Similarly, given p MIS- 



p, to find a (p + l)th MISet M\ M' must dif- 



Proof of Theorem |4,2[ We scan the input records i ^ I, one at a 
time, and apply O to {i} individually. Whenever we have 0({i}), 
we return Pany{o) = {i}, and add {i} to Paii{o), and add the 



element i to Puni{o), initialized to 0. 

, else set Ptmp{o) = Pu 



If \Pum{o)\ > 2, we set 
ii{o). Finally, P^nt can be 
computed from Puni- 

It can be seen easily that provenance for one -to-many oper- 
ators can be computed in a similar fashion. The only difference 
is that the number of output records can now be larger than the 
number of input records, i.e., M may be larger than N. Hence the 
complexity increases to (9 (M + A^). □ 



Proof of Theorem 4.3; We perform 0{N) calls to the operator, 
and for each call, we may have to look at an output of size 0{M). 
Correctness of the algorithm follows easily: For any record i to be 
in the intersection of all MISets, removing i from the input must 
remove the output record. □ 

Proof of Theorem \4A\ First we prove #P-hardness for a 
many-to-one operator (and hence for an arbitrary opera- 
tor), and then show that the problem is in #P, which applies for 
many-to-one and arbitrary monotonic operators, complet- 
ing our proof. 

1. #P-hardness We give a reduction from the problem finding 
all minimal vertex covers. Given a graph G{V,E), our goal 
is to compute all Vmin ^ V such that (1) Vmin is a cover: 
each edge e ^ E has an endpoint in Vmin, (2) Vmin is min- 
imal: No proper subset of Vmin is a cover. Given the input 
G{V,E), we create an instance of finding Paii as follows: 
I = V, O = {1}, our goal is to find all MISets of 1. O 
takes as input any subset Is C /, and returns {1} if the cor- 
responding set of vertices Vs is a (not necessarily minimal) 
cover of E in G, and returns {0} otherwise. Note that each 
minimal vertex cover of G corresponds to a MISet of 1, and 
each MISet gives a minimal vertex cover. Finally, note that 
our operator generates a single output record, and is therefore 
many-to-one. 

2. #P Given any Is C /, we can check in PTIME whether Is is 



Proof of Corollary |4.1t Note that the hardness result of Theo- 
rem |4.4| holds even if our goal was to only count the number of 
minimal vertex covers, or equivalently, find the number of MISets. 
We can translate the problem of counting the number of MISets to 
computing Pimp for a special input tuple G /. Given an in- 
put {Op,I,o G O) to Pall, we create {Op,I\o G O), where 
r = lU{e} and for any C r we have Op'{Is) = Op{Is) if 
and only if i* G h and Op (Is) = if /s- Counting the num- 
ber of MISets for O now reduces to the problem of determining 

P^mpfOYi*. □ 

Proof of Theorem |4,5t The theorem follows directly based on an 
explicit search over all possible inputs of size of at most B to find 
Pall. All Other P* are subsequently computed using Paii. □ 

Proof of Theorem I5,H The main idea used in the result is that 
the property of MISets for monotonic operators ensures that Vr : 
Pall (0,I,r G R) for any operator is sufficient to reconstruct (and 
execute) O for any subset /s C /: Using monotonicity, we know 
that 0(ls) ^ 0{I), hence we only need to determine for every 
r ^ R, whether r G 0{ls)- Using the property of MISets, we 
have r G 0(/s) if and only if there is a MISet of r, say Mr C J^, 
allowing us to exactly construct 0(/s). 

Given the above fact that Paii enables reconstructing any oper- 
ator, the two expressions in the theorem merely simulate the ex- 
ecution of each operator: For a record r2 to be in the output of 
(O2 o Oi), some MISet M2 of r2 for O2 must be contained in 
the output of Oi. Such an MISet M2 is in the output of Oi, i.e., 
M2 c Os. □ 

B. EXPANDED RELATED WORK 

Recognizing the need for a principled approach to assisting IE 
developers and users, several methods have been proposed to en- 
able exploration phases. Shen et al. 1 16 1 presented an iterative ap- 
proach to developing IE systems, where users begin with an "ap- 
proximate" extraction query. Based on the results of this query, 
users may refine the follow-up query. Jain et al. |9| presented 
a query model for IE tasks for the purpose of exploring whether 
a database is useful for the IE task or not. Following a similar 
spirit, optimization strategies that enable users to efficiently fetch 
IE results with pre-specified output quality (e.g., minimum num- 
ber of good tuples and maximum tolerable bad tuples) have been 
proposed for single IE systems 1 6 7 1 as well as multiple IE sys- 
tems |8|. While such exploration phases enable IE developers to 
assess the quality of an IE system, they mostly focus on answering 
the question, ''Is a text database D a good choice for the IE system 
at hand?'' Furthermore, users are provided with very little insights 
into why an IE system does not perform as expected. 

Assuming full access and control to the code for each opera- 
tor in an IE pipeline, prior work jTSj presented methods to build 
IE programs involving multiple operators using Datalog to gener- 
ate programs that are easy to read and thus easier to debug. Our 
approach considers a generic IE pipeline that may involve any ar- 
bitrary operators for which we may not have exact specifications 
or access to the code. Recently, | 3 1 addressed the problem of un- 
derstanding "provenance black boxes"; the goal of their work is to 
provide users with an easier way to understand provenance infor- 
mation, allowing them to aggregate or drill down on provenance. 
In contrast, our goal is to build a provenance model that is suit- 
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Figure 4: Sample input output for three steps, namely, sg, ad, and pn, in our extraction pipeline. 



able for black-box operators in an extraction pipeline. Note that we 
make no direct contribution on user understanding of provenance; 
rather, we produce minimal sets of provenance and their composi- 
tions in order to quickly understand errors in the data produced by 
the pipeline. 

Our prior work 1 14 1 presented debugging algorithms for IE tasks; 
our current work substantially differs from and extends this work. 
The techniques in 1 14| focused on a simple form of IE system, 
namely, iterative IE methods |11|. Furthermore, we assumed we 
had complete knowledge on how each operator was designed and 
exact operator and input-output specifications . Moreover, the fo- 
cus of fT4) was to utilize the relatively simple provenance model to 
enable efficient algorithms for explanation, diagnosis, and repair. 
This paper developed a new provenance model for arbitrary extrac- 
tion operators, and presented algorithms for building this prove- 
nance. Also relevant previous work on provenance is that of |4|, 
which addresses the problem of deriving the provenance (explana- 
tions) for non-answers in extracted data. The paper considers con- 
junctive queries, and for every potential tuple t in an answer to a 
conjunctive query, the authors provide techniques for determining 
updates to base data that would produce t in the output. Once again, 
our work relaxes these assumptions and enables debugging over 
complex IE pipelines consisting of arbitrary black-box operators. 
Finally, there is a large body of previous work on provenance for 
relational databases (refer to 1 17 5 1 for surveys); this work does not 
meet our two-fold requirements of provenance for black-box oper- 
ators, and designing provenance to minimize editorial work during 
debugging. 

C. EXTENSIONS 

In this section, we very briefly discuss the extension of 
PROBERS framework for absence of records from the output, 
which is particularly useful for non-monotonic operators. We 
emphasize that this section is primarily meant to indicate that 
PROBER is amenable to these extensions. However, we are cur- 
rently developing precise details, and our current system does not 
support these extensions. 

For debugging the absence of records from an output of any non- 
monotonic operator, we may analogously define a notion of Maxi- 
mal Superset (MASet). 

Definition C.l. [MASet] Given an operator O, its input / 
and output we say that Is C / is a Maximal Superset (MASet) 
of r i? if and only if: (1) r G Op{Is)\ and (2) V/' : r D 
Is J' CI^r^Op{r). □ 

Just as in the case of MISets, it's easy to see that MASets are also 
not unique: 

Example C.l. In Example [TT] if the operator returned "NO" 
whenever there were fewer than 50 records in the input, then the 
MASet of "NO" is any set of 49 input records. □ 



MISets are useful for debugging based on the presence records in 
the output of an operator, while MASets are useful for debugging 
the absence of records from the output: For every record that is 
output by an operator, its MASet is the entire input, and hence its 
MASet doesn't help in fixing an erroneous output record. However, 
the MISet of an erroneous record points to potential incorrect input 
that caused the error. Conversely, for a record that is absent in the 
output, MASets help identify what caused the record to get omitted 
from the output. 



